decision score
BEDI: A Comprehensive Benchmark for Evaluating Embodied Agents on UAVs
Guo, Mingning, Wu, Mengwei, He, Jiarun, Li, Shaoxian, Li, Haifeng, Tao, Chao
With the rapid advancement of low-altitude remote sensing and Vision-Language Models (VLMs), Embodied Agents based on Unmanned Aerial Vehicles (UAVs) have shown significant potential in autonomous tasks. However, current evaluation methods for UAV-Embodied Agents (UAV-EAs) remain constrained by the lack of standardized benchmarks, diverse testing scenarios and open system interfaces. To address these challenges, we propose BEDI (Benchmark for Embodied Drone Intelligence), a systematic and standardized benchmark designed for evaluating UAV-EAs. Specifically, we introduce a novel Dynamic Chain-of-Embodied-Task paradigm based on the perception-decision-action loop, which decomposes complex UAV tasks into standardized, measurable subtasks. Building on this paradigm, we design a unified evaluation framework encompassing six core sub-skills: semantic perception, spatial perception, motion control, tool utilization, task planning and action generation. Furthermore, we develop a hybrid testing platform that incorporates a wide range of both virtual and real-world scenarios, enabling a comprehensive evaluation of UAV-EAs across diverse contexts. The platform also offers open and standardized interfaces, allowing researchers to customize tasks and extend scenarios, thereby enhancing flexibility and scalability in the evaluation process. Finally, through empirical evaluations of several state-of-the-art (SOTA) VLMs, we reveal their limitations in embodied UAV tasks, underscoring the critical role of the BEDI benchmark in advancing embodied intelligence research and model optimization. By filling the gap in systematic and standardized evaluation within this field, BEDI facilitates objective model comparison and lays a robust foundation for future development in this field. Our benchmark is now publicly available at https://github.com/lostwolves/BEDI.
- Asia > China > Beijing > Beijing (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
- (8 more...)
- Workflow (1.00)
- Research Report (1.00)
- Government > Military (0.67)
- Information Technology > Robotics & Automation (0.48)
- Transportation > Marine (0.46)
- Health & Medicine > Therapeutic Area (0.46)
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Nevada (0.04)
- (2 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.69)
Image Quality Assessment for Embodied AI
Li, Chunyi, Xiao, Jiaohao, Zhang, Jianbo, Wen, Farong, Zhang, Zicheng, Tian, Yuan, Zhu, Xiangyang, Liu, Xiaohong, Cheng, Zhengxue, Lin, Weisi, Zhai, Guangtao
Embodied AI has developed rapidly in recent years, but it is still mainly deployed in laboratories, with various distortions in the Real-world limiting its application. Traditionally, Image Quality Assessment (IQA) methods are applied to predict human preferences for distorted images; however, there is no IQA method to assess the usability of an image in embodied tasks, namely, the perceptual quality for robots. To provide accurate and reliable quality indicators for future embodied scenarios, we first propose the topic: IQA for Embodied AI. Specifically, we (1) based on the Mertonian system and meta-cognitive theory, constructed a perception-cognition-decision-execution pipeline and defined a comprehensive subjective score collection process; (2) established the Embodied-IQA database, containing over 36k reference/distorted image pairs, with more than 5m fine-grained annotations provided by Vision Language Models/Vision Language Action-models/Real-world robots; (3) trained and validated the performance of mainstream IQA methods on Embodied-IQA, demonstrating the need to develop more accurate quality indicators for Embodied AI. We sincerely hope that through evaluation, we can promote the application of Embodied AI under complex distortions in the Real-world. Project page: https://github.com/lcysyzxdxc/EmbodiedIQA
- Research Report (0.63)
- Workflow (0.46)
ScaleDoc: Scaling LLM-based Predicates over Large Document Collections
Zhang, Hengrui, Hui, Yulong, Liu, Yihao, Zhang, Huanchen
Predicates are foundational components in data analysis systems. However, modern workloads increasingly involve unstructured documents, which demands semantic understanding, beyond traditional value-based predicates. Given enormous documents and ad-hoc queries, while Large Language Models (LLMs) demonstrate powerful zero-shot capabilities, their high inference cost leads to unacceptable overhead. Therefore, we introduce \textsc{ScaleDoc}, a novel system that addresses this by decoupling predicate execution into an offline representation phase and an optimized online filtering phase. In the offline phase, \textsc{ScaleDoc} leverages a LLM to generate semantic representations for each document. Online, for each query, it trains a lightweight proxy model on these representations to filter the majority of documents, forwarding only the ambiguous cases to the LLM for final decision. Furthermore, \textsc{ScaleDoc} proposes two core innovations to achieve significant efficiency: (1) a contrastive-learning-based framework that trains the proxy model to generate reliable predicating decision scores; (2) an adaptive cascade mechanism that determines the effective filtering policy while meeting specific accuracy targets. Our evaluations across three datasets demonstrate that \textsc{ScaleDoc} achieves over a 2$\times$ end-to-end speedup and reduces expensive LLM invocations by up to 85\%, making large-scale semantic analysis practical and efficient.
- North America > United States > California > Santa Clara County > Santa Clara (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- North America > Dominican Republic (0.04)
- (8 more...)
- Overview (1.00)
- Research Report (0.82)
- Workflow (0.68)
A Neural Rejection System Against Universal Adversarial Perturbations in Radio Signal Classification
Zhang, Lu, Lambotharan, Sangarapillai, Zheng, Gan, Roli, Fabio
Advantages of deep learning over traditional methods have been demonstrated for radio signal classification in the recent years. However, various researchers have discovered that even a small but intentional feature perturbation known as adversarial examples can significantly deteriorate the performance of the deep learning based radio signal classification. Among various kinds of adversarial examples, universal adversarial perturbation has gained considerable attention due to its feature of being data independent, hence as a practical strategy to fool the radio signal classification with a high success rate. Therefore, in this paper, we investigate a defense system called neural rejection system to propose against universal adversarial perturbations, and evaluate its performance by generating white-box universal adversarial perturbations. We show that the proposed neural rejection system is able to defend universal adversarial perturbations with significantly higher accuracy than the undefended deep neural network.
- Europe > United Kingdom > England > Leicestershire > Loughborough (0.04)
- Europe > Italy > Sardinia > Cagliari (0.04)
High-Dimensional Analysis of Bootstrap Ensemble Classifiers
Cherkaoui, Hamza, Tiomoko, Malik, Seddik, Mohamed El Amine, Louart, Cosme, Schnoor, Ekkehard, Kegl, Balazs
Bootstrap methods have long been a cornerstone of ensemble learning in machine learning. This paper presents a theoretical analysis of bootstrap techniques applied to the Least Square Support Vector Machine (LSSVM) ensemble in the context of large and growing sample sizes and feature dimensionalities. Leveraging tools from Random Matrix Theory, we investigate the performance of this classifier that aggregates decision functions from multiple weak classifiers, each trained on different subsets of the data. We provide insights into the use of bootstrap methods in high-dimensional settings, enhancing our understanding of their impact. Based on these findings, we propose strategies to select the number of subsets and the regularization parameter that maximize the performance of the LSSVM. Empirical experiments on synthetic and real-world datasets validate our theoretical results.
- Europe > Germany > Berlin (0.04)
- Europe > France (0.04)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- (3 more...)
BCE vs. CE in Deep Feature Learning
Li, Qiufu, Xiao, Huibin, Shen, Linlin
When training classification models, it expects that the learned features are compact within classes, and can well separate different classes. As the dominant loss function for training classification models, minimizing cross-entropy (CE) loss maximizes the compactness and distinctiveness, i.e., reaching neural collapse (NC). The recent works show that binary CE (BCE) performs also well in multi-class tasks. In this paper, we compare BCE and CE in deep feature learning. For the first time, we prove that BCE can also maximize the intra-class compactness and inter-class distinctiveness when reaching its minimum, i.e., leading to NC. We point out that CE measures the relative values of decision scores in the model training, implicitly enhancing the feature properties by classifying samples one-by-one. In contrast, BCE measures the absolute values of decision scores and adjust the positive/negative decision scores across all samples to uniformly high/low levels. Meanwhile, the classifier biases in BCE present a substantial constraint on the decision scores to explicitly enhance the feature properties in the training. The experimental results are aligned with above analysis, and show that BCE could improve the classification and leads to better compactness and distinctiveness among sample features. The codes will be released.
- Asia > China > Guangdong Province > Shenzhen (0.04)
- North America > Canada (0.04)
42e77b63637ab381e8be5f8318cc28a2-Paper.pdf
Rapid categorization paradigms have a long history in experimental psychology: Characterized by short presentation times and speeded behavioral responses, these tasks highlight the efficiency with which our visual system processes natural object categories. Previous studies have shown that feed-forward hierarchical models of the visual cortex provide a good fit to human visual decisions. At the same time, recent work in computer vision has demonstrated significant gains in object recognition accuracy with increasingly deep hierarchical architectures. But it is unclear how well these models account for human visual decisions and what they may reveal about the underlying brain processes. We have conducted a large-scale psychophysics study to assess the correlation between computational models and human behavioral responses on a rapid animal vs. non-animal categorization task. We considered visual representations of varying complexity by analyzing the output of different stages of processing in three stateof-the-art deep networks. We found that recognition accuracy increases with higher stages of visual processing (higher level stages indeed outperforming human participants on the same task) but that human decisions agree best with predictions from intermediate stages. Overall, these results suggest that human participants may rely on visual features of intermediate complexity and that the complexity of visual representations afforded by modern deep network models may exceed the complexity of those used by human participants during rapid categorization.
- North America > United States > Rhode Island > Providence County > Providence (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Nevada (0.04)
- (2 more...)
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Liu, Rui, Wang, Xiaohan, Wang, Wenguan, Yang, Yi
Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, which hinders their ability to perceive 3D scene geometry and easily leads to ambiguous selection of panoramic view. To address these limitations, we present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment under the supervision of 3D detection. During navigation, BSG builds a local BEV representation at each step and maintains a BEV-based global scene map, which stores and organizes all the online collected local BEV representations according to their topological relations. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a sub-view selection score on panoramic views, for more accurate action prediction. Our approach significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R, showing the potential of BEV perception in VLN.